Some Methods for Classification and Analysis of Multivariate Observations

نویسنده

  • J. MACQUEEN
چکیده

The main purpose of this paper is to describe a process for partitioning an N-dimensional population into k sets on the basis of a sample. The process, which is called 'k-means,' appears to give partitions which are reasonably efficient in the sense of within-class variance. That is, if p is the probability mass function for the population, S = {S1, S2, * *, Sk} is a partition of EN, and ui, i = 1, 2, * , k, is the conditional mean of p over the set Si, then W2(S) = ff=ISi f z u42 dp(z) tends to be low for the partitions S generated by the method. We say 'tends to be low,' primarily because of intuitive considerations, corroborated to some extent by mathematical analysis and practical computational experience. Also, the k-means procedure is easily programmed and is computationally economical, so that it is feasible to process very large samples on a digital computer. Possible applications include methods for similarity grouping, nonlinear prediction, approximating multivariate distributions, and nonparametric tests for independence among several variables. In addition to suggesting practical classification methods, the study of k-means has proved to be theoretically interesting. The k-means concept represents a generalization of the ordinary sample mean, and one is naturally led to study the pertinent asymptotic behavior, the object being to establish some sort of law of large numbers for the k-means. This problem is sufficiently interesting, in fact, for us to devote a good portion of this paper to it. The k-means are defined in section 2.1, and the main results which have been obtained on the asymptotic behavior are given there. The rest of section 2 is devoted to the proofs of these results. Section 3 describes several specific possible applications, and reports some preliminary results from computer experiments conducted to explore the possibilities inherent in the k-means idea. The extension to general metric spaces is indicated briefly in section 4. The original point of departure for the work described here was a series of problems in optimal classification (MacQueen [9]) which represented special

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Application of multivariate techniques in-line with spatial regionalization of AOD over Iran

Application of multivariate techniques in-line with spatial regionalization of AOD over Iran Introduction Models, satellites and terrestrial datasets have been used to detect and characterize aerosol. Nontheless, micoscale classification using remote sensing parameters considers as a deficiency. Thus, regionalizion and modeling aerosol without regard to political boundaries or a specific s...

متن کامل

Study of the metabolic profile of Papaver extracts by chromatographic and chemometrics methods

Background and objectives: Chromatography fingerprinting is considered as a comprehensive method for quality control, diagnosis and the nature of herbal drugs, and it is important to classify the different samples of medicinal plants and determine the chemical species present in them. Methods: In this research, a new strategy based on the combination of multiva...

متن کامل

On Model-Based Clustering, Classification, and Discriminant Analysis

The use of mixture models for clustering and classification has burgeoned into an important subfield of multivariate analysis. These approaches have been around for a half-century or so, with significant activity in the area over the past decade. The primary focus of this paper is to review work in model-based clustering, classification, and discriminant analysis, with particular attenti...

متن کامل

Linear and Nonlinear Multivariate Classification of Iranian Bottled Mineral Waters According to Their Elemental Content Determined by ICP-OES

The combinations of inductively coupled plasma-optical emission spectrometry (ICP-OES) and three classification algorithms, i.e., partial least squares discriminant analysis (PLS-DA), least squares support vector machine (LS-SVM) and soft independent modeling of class analogies (SIMCA), for discriminating different brands of Iranian bottled mineral waters, were explored. ICP-OES was used for th...

متن کامل

Outlier test for a group of multivariate observations

Assume that we have m independent random samples each of size n from Np(; ) and our goal is to test whether or not the ith sample is an outlier (i=1,2,…..m). To date it is well known that a test statistics exist whose null distribution is Betta and given the relationship between Betta and F distribution, an F test statistic can be used. In the statistical literature however a clear and preci...

متن کامل

پهنه‌بندی زیست اقلیمی استان چهارمحال و بختیاری با استفاده از روش‌های آماری چندمتغیره

The temporal and spatial vegetation dynamics is highly dependent on many different environmental and biophysical factors. Among these, climate is one of the most important factors that influence the growth and condition of vegetation. Of the abiotic factors affecting the geographic distribution of vegetation type, climate is probably the most important. Ecological research has traditionally aim...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1967